Feat/close preconfirmed on graceful shutdown #859

Mohiiit · 2025-11-19T11:07:03Z

Graceful Shutdown with Preconfirmed Block Closing

Summary

Implements graceful shutdown for the block production service that properly closes any open preconfirmed block during shutdown, using the executor's existing state without re-execution.

Problem

Previously, when graceful shutdown was triggered, the block production service would exit immediately, leaving any open preconfirmed block unclosed. This required re-execution of transactions on restart, which is inefficient and can cause inconsistencies.

Solution

The implementation adds a graceful shutdown flow that:

Detects cancellation before the batcher completes
Sends CloseBlock command to the executor thread (which is still running)
Waits for EndBlock message from the executor with the block's execution summary
Closes the block using the executor's existing state (no re-execution needed)
Handles edge cases with timeout protection and graceful degradation

Implementation Details

Shutdown Flow

The shutdown process follows 5 concurrent paths in a tokio::select! loop:

Cancellation Detection: Detects shutdown request and attempts to close block
Batcher Completion: Handles batcher task completion and fallback block closing
EndBlock Processing: Receives and processes EndBlock message from executor
Timeout Protection: Prevents infinite waiting with 5-second timeout
Executor Shutdown: Handles executor thread completion

State Management

ShutdownState: Encapsulates shutdown state (shutting_down, batcher_completed, end_block_deadline)
try_close_block_on_shutdown(): Attempts to close block and returns ControlFlow for loop control

Note

If graceful shutdown fails for any reason, the block remains preconfirmed and is handled on restart via the existing close_preconfirmed_block_if_exists() mechanism.

prkpndy · 2025-11-20T10:26:29Z

madara/crates/client/block_production/src/lib.rs

+    #[tokio::test]
+    async fn test_graceful_shutdown_closes_preconfirmed_block(
+        #[future]
+        #[with(Duration::from_secs(3000000000), false)]


this seems to be too big? should we not reduce it?

prkpndy · 2025-11-20T10:28:21Z

madara/crates/client/block_production/src/lib.rs

+        task.await.unwrap();
+
+        // Give a small delay to ensure database writes complete
+        tokio::time::sleep(Duration::from_millis(100)).await;


why is this needed tho? aren't all service (including DB) is closed once we shutdown?

prkpndy · 2025-11-20T10:29:25Z

madara/crates/client/block_production/src/lib.rs

+        // Graceful shutdown flow:
+        //
+        // ┌─────────────────────────────────────────────────────────────────┐
+        // │                    Graceful Shutdown Flow                        │


can format this

prkpndy · 2025-11-20T10:42:22Z

madara/crates/client/block_production/src/lib.rs

+    #[rstest::rstest]
+    #[timeout(Duration::from_secs(30))]
+    #[tokio::test]
+    async fn test_graceful_shutdown_closes_preconfirmed_block(


Is it possible to add some more test cases for edge cases?

prkpndy · 2025-11-20T10:59:12Z

madara/crates/client/block_production/src/lib.rs

+                    if let Some(deadline) = shutdown_state.end_block_deadline.as_ref() {
+                        tokio::time::sleep_until(*deadline).await
+                    } else {
+                        std::future::pending().await
+                    }


I think we can remove the if-else statement and directly sleep until deadline

apoorvsadana

As discussed, let's try an approach where

we pass cancellation ctx to executor
block production service waits for batcher, executor and for block to be closed

apoorvsadana · 2025-11-20T12:47:18Z

madara/crates/client/block_production/src/lib.rs

+                    // Executor channel is closed - executor has already shut down
+                    // This can happen if executor detected channel closure before we sent CloseBlock
+                    // In this case, the block will remain preconfirmed and be handled on restart


is this possible? if yes, we should fix it as a part of the PR. we shouldn't have a known race condition. if not, this comment probably needs to change.

apoorvsadana · 2025-11-20T12:53:07Z

madara/crates/client/block_production/src/lib.rs

+        // 1. Cancellation Signal (ctx.cancel_global())
+        //    │
+        //    ├─> Batcher detects cancellation → exits → closes send_batch channel
+        //    │
+        //    └─> Main loop detects cancellation (via ctx.cancelled())
+        //        │
+        //        ├─> Check: Is there an open preconfirmed block?
+        //        │   │
+        //        │   ├─> YES: Send CloseBlock command to executor
+        //        │   │   │
+        //        │   │   ├─> Executor receives CloseBlock → sets force_close = true
+        //        │   │   │   │
+        //        │   │   │   └─> On next iteration, executor checks force_close
+        //        │   │   │       │
+        //        │   │   │       └─> Calls finalize() → sends EndBlock message
+        //        │   │   │
+        //        │   │   └─> Main loop waits for EndBlock (with timeout)
+        //        │   │       │
+        //        │   │       ├─> EndBlock received → process_reply() closes block
+        //        │   │       │   └─> Return (shutdown complete)
+        //        │   │       │
+        //        │   │       └─> Timeout/Error → block remains preconfirmed (handled on restart)
+        //        │   │
+        //        │   └─> NO: Continue to wait for batcher completion
+        //        │
+        // 2. Batcher Completion (alternative path if cancellation not detected first)
+        //    │
+        //    └─> Check: Is there an open preconfirmed block?
+        //        │
+        //        └─> Same flow as above (send CloseBlock → wait for EndBlock)
+        //
+        // 3. Executor Shutdown
+        //    │
+        //    └─> Executor detects send_batch channel closure → exits gracefully
+        //        └─> Signals via executor.stop channel
+        //
+        // Key Safety Features:
+        // - Timeout protection: Won't wait indefinitely for EndBlock
+        // - State validation: Only sends CloseBlock when block exists
+        // - Graceful degradation: If executor already shut down, block handled on restart
+        // - No re-execution: Uses executor's existing state (no transaction re-execution needed)
+        //
+        // Note: The executor thread shuts down, dropping the `executor.stop` channel, therefore closing it as well.


i think these comments are more detailed than they need to be. i am worried they will get outdated very easily because someone will update the executor thread and forget to change this comment because its in a completely different place. maybe this detail makes sense in the PR desc

apoorvsadana · 2025-11-20T13:02:38Z

madara/crates/client/block_production/src/lib.rs

+                //   - Executor crashes before sending EndBlock
+                //   - EndBlock gets lost in transit
+                //   - Any other unexpected failure
+                _ = async {


doesn't the services layer already have a deadline after which it forcefully kills the service?

apoorvsadana · 2025-11-20T13:10:45Z

madara/crates/client/block_production/src/lib.rs

+                //   4. If CloseBlock succeeds → wait for EndBlock
+                //   5. If CloseBlock fails or no block → shutdown complete
+                res = &mut batcher_task, if !shutdown_state.batcher_completed => {
+                    res.context("In batcher task")?;


if the batcher task closes because of an error, then ? will technically return before we gracefully shutdown?

comment still applies?

apoorvsadana · 2025-11-21T13:18:45Z

madara/crates/client/block_production/src/lib.rs

+                //   4. If CloseBlock succeeds → wait for EndBlock
+                //   5. If CloseBlock fails or no block → shutdown complete
+                res = &mut batcher_task, if !shutdown_state.batcher_completed => {
+                    res.context("In batcher task")?;


comment still applies?

apoorvsadana · 2025-11-21T13:20:49Z

madara/crates/client/block_production/src/lib.rs

                    self.process_reply(reply).await.context("Processing reply from executor thread")?;
+
+                    // If we're shutting down and just processed EndBlock, shutdown is complete
+                    if shutting_down && is_end_block {


it's possible batcher hasn't closed yet right? image a scenario where batcher didn't detect signal yet, shutting_down has been made true and executor closed block because block time was reached.

apoorvsadana · 2025-11-21T13:29:52Z

madara/crates/client/block_production/src/lib.rs

+                    if shutting_down {
+                        // Executor exited during shutdown
+                        // If there was a block, executor should have sent EndBlock before exiting
+                        // If we're here, either no block existed or executor panicked
+                        // In case of panic, it will propagate naturally
+                        tracing::debug!("Executor shut down during graceful shutdown");
+                        return Ok(());
+                    }
+                    // Normal executor completion (not during shutdown)
+                    // If executor panicked, recv() will resume the panic (handled by StopErrorReceiver)
+                    res.context("In executor thread")?;
+                    return Ok(());


can't we just check if res is an error, if yes we add context, otherwise we don't? or maybe we add context all the time. unsure what this if condition is solving here.

apoorvsadana · 2025-11-21T13:32:44Z

madara/crates/client/block_production/src/lib.rs

+    #[rstest::rstest]
+    #[timeout(Duration::from_secs(30))]
+    #[tokio::test]
+    async fn test_graceful_shutdown_closes_preconfirmed_block_with_multiple_transactions(


let's combine this test and the 1 transaction test?
i think having multiple cases like these make more sene when we've smaller unit tests, not in these ones (or the ones we had in cairo native)

apoorvsadana · 2025-11-21T13:35:44Z

madara/crates/client/block_production/src/lib.rs

+    // This test verifies that graceful shutdown completes successfully even in edge cases.
+    // With the new implementation, the executor automatically closes blocks when it detects
+    // the send_batch channel closure (WaitTxBatchOutcome::Exit). This test ensures shutdown
+    // completes gracefully regardless of timing.
+    //
+    // The scenario: Cancellation → batcher completes → closes send_batch → executor detects Exit
+    // → executor closes block automatically → sends EndBlock → main loop closes block → shutdown complete
+    #[rstest::rstest]
+    #[timeout(Duration::from_secs(30))]
+    #[tokio::test]
+    async fn test_graceful_shutdown_closeblock_fails_executor_channel_closed(
+        #[future]


what's the edge case here? not sure how is this different from the previous test case?

Mohiiit added 6 commits November 11, 2025 13:09

orch test: parallelization done

48296d4

Merge branch 'main' of https://github.com/madara-alliance/madara

61a992c

Merge branch 'main' of https://github.com/madara-alliance/madara

5a07f94

chore: main merged

ee69832

Merge branch 'main' of https://github.com/madara-alliance/madara

8db336c

feat: graceful shutdown closes block now

89db96c

github-project-automation bot added this to Madara Nov 19, 2025

Mohiiit self-assigned this Nov 19, 2025

Mohiiit added the madara label Nov 19, 2025

Mohiiit and others added 4 commits November 19, 2025 16:39

chore: orch related BT

be16e13

chore: revert orch related stuff

a9a52bb

Merge branch 'main' into feat/close-preconfirmed-on-graceful-shutdown

ec2a7d6

Merge branch 'main' into feat/close-preconfirmed-on-graceful-shutdown

a1ddbcc

prkpndy approved these changes Nov 20, 2025

View reviewed changes

apoorvsadana reviewed Nov 20, 2025

View reviewed changes

Mohiiit and others added 4 commits November 21, 2025 10:08

feat(block-prod): tests added + logic simplified

6eff754

chore: main merged

e259a28

Merge branch 'main' into feat/close-preconfirmed-on-graceful-shutdown

4a39cce

chore: main merged

3b2d5cf

apoorvsadana reviewed Nov 21, 2025

View reviewed changes

Mohiiit and others added 5 commits November 22, 2025 15:13

fix: batcher completion logic

aa0bc73

Merge branch 'main' into feat/close-preconfirmed-on-graceful-shutdown

85cea0d

chore: docs updated

bb54d6d

fix: block production graceful shutdown simplified

c783ab8

feat: handling graceful block closing with separate message

bf7f09a

Feat/close preconfirmed on graceful shutdown #859

Are you sure you want to change the base?

Feat/close preconfirmed on graceful shutdown #859

Uh oh!

Conversation

Mohiiit commented Nov 19, 2025

Graceful Shutdown with Preconfirmed Block Closing

Summary

Problem

Solution

Implementation Details

Shutdown Flow

State Management

Note

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

apoorvsadana left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants